Research Synthesis Methods
○ Wiley
All preprints, ranked by how well they match Research Synthesis Methods's content profile, based on 20 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Janoudi, G.; Uzun, M.; Jurdana, M.; Hutton, B.
Show abstract
BackgroundSystematic literature reviews (SLRs) are essential for evidence synthesis but are hampered by the resource-intensive full-text screening phase. Loon Lens Pro, a publicly available agentic AI tool, automates full-text screening without prior training by using user-defined inclusion/exclusion criteria and multiple specialized AI agents. This study validated Loon Lens Pro against human reviewers to assess its accuracy, efficiency, and confidence scoring in screening. MethodsIn this comparative validation study, 84 full-text articles from eight SLRs were screened by both Loon Lens Pro and human reviewers (gold standard). The AI provided binary inclusion/exclusion decisions along with a transparent rationale and confidence ratings (low, medium, high). Performance metrics-- including accuracy, sensitivity, specificity, negative predictive value, precision, and F1 score--were derived from a confusion matrix. Logistic regression with bootstrap resampling (1,000 iterations) evaluated the association between confidence scores and screening errors. ResultsLoon Lens Pro correctly classified 70 of 84 full texts, achieving an accuracy of 83.3% (95% CI: 75.0- 90.5%), sensitivity of 94.7% (95% CI: 82.4-100%), and specificity of 80.0% (95% CI: 70.1-89.2%). The negative predictive value was 98.1% (95% CI: 93.8-100%), with a precision of 58.1% (95% CI: 41.4- 76.0%) and an F1 score of 0.72. Logistic regression revealed a strong inverse relationship between confidence level and error probability: low, medium, and high confidence decisions were associated with predicted error probabilities of 46.9%, 30.9%, and 3.5%, respectively (C-index = 0.87). ConclusionOur study provides evidence that Loon Lens Pro is a viable and effective tool for automating the full-text screening phase of systematic reviews. Its high sensitivity, robust confidence scoring mechanism, and transparent rationale generation collectively support its potential to alleviate the burden of manual screening without compromising the quality of study selection.
Bracchiglione, J.; Meza, N.; Lunny, C.; Pieper, D.; Madrid, E.; Urrutia, G.; Bonfill Cosp, X.
Show abstract
IntroductionOverlap of primary studies among systematic reviews (SRs) included in an overview is a major challenge, as it may bias results or artificially increase the precision of the synthesis. Matrices of evidence and corrected covered area (CCA) calculation are recommended methods to manage overlap, but there is little guidance on how to construct these matrices. This research aims to explore variations in the estimation of overlap using CCA matrices under different assumptions. MethodsWe will include overviews published in 2023. We will describe the methods used by authors to deal with overlap, and we will calculate a summary CCA (a CCA for the whole matrix of evidence) and a pairwise CCA (a CCA for each possible pair of included SRs), comparing the results under different assumptions that may modify the evidence matrix and thus the CCA. These assumptions include: publication-thread adjustments (i.e. the consideration of each set of references regarding a single primary study as a unique row in a matrix of evidence), scope adjustments (i.e. the consideration only of the SRs and primary studies providing useful data for a given outcome within an overview) and chronological structural missingness adjustments (i.e. the exclusion of primary studies published after a given SR for purposes of CCA calculation). We will assess overlap at an overview and outcome level. DiscussionWe propose clear definitions for the key assumptions for creating matrices of evidence. We expect to provide a guide for overview authors to better interpret their CCA estimations. Article SummaryO_ST_ABSStrengths and limitations of this studyC_ST_ABSO_LIThis protocol explores the assumptions underlying the overlap assessment in overviews of systematic reviews, that so far have not been explicitly addressed. C_LIO_LIThese assumptions include scope adjustments, publication-thread adjustments, structural missingness adjustments, and analysis at an overview or outcome level. C_LIO_LIWe provide clear definitions for key overlap concepts that will guide authors for making their overlap assessments more explicit when using a matrix of evidence or corrected covered area approach. C_LIO_LIWe plan to conduct exploratory analyses under different assumptions in a purposive sample of overview, hence, we will not comprehensively include all the overviews in the study period. C_LIO_LIWe will conduct all the analysis calculating the corrected covered area for the whole matrices (overall approach) and for every possible pair of systematic reviews within each matrix (pairwise approach). C_LI
Janoudi, G.; Rada (Uzun), M.; Jurdana, M.; Fuzul, E.; Ivkovic, J.
Show abstract
IntroductionSystematic literature reviews (SLRs) are critical for informing clinical research and practice, but they are time-consuming and resource-intensive, particularly during Title and Abstract (TiAb) screening. Loon Lens, an autonomous, agentic AI platform, streamlines TiAb screening without the need for human reviewers to conduct any screening. MethodsThis study validates Loon Lens against human reviewer decisions across eight SLRs conducted by Canadas Drug Agency, covering a range of drugs and eligibility criteria. A total of 3,796 citations were retrieved, with human reviewers identifying 287 (7.6%) for inclusion. Loon Lens autonomously screened the same citations based on the provided inclusion and exclusion criteria. Metrics such as accuracy, recall, precision, F1 score, specificity, and negative predictive value (NPV) were calculated. Bootstrapping was applied to compute 95% confidence intervals. ResultsLoon Lens achieved an accuracy of 95.5% (95% CI: 94.8-96.1), with recall at 98.95% (95% CI: 97.57-100%) and specificity at 95.24% (95% CI: 94.54-95.89%). Precision was lower at 62.97% (95% CI: 58.39-67.27%), suggesting that Loon Lens included more citations for full-text screening compared to human reviewers. The F1 score was 0.770 (95% CI: 0.734-0.802), indicating a strong balance between precision and recall. ConclusionLoon Lens demonstrates the ability to autonomously conduct TiAb screening with a substantial potential for reducing the time and cost associated with manual or semi-autonomous TiAb screening in SLRs. While improvements in precision are needed, the platform offers a scalable, autonomous solution for systematic reviews. Access to Loon Lens is available upon request at https://loonlens.com/.
Davis, R. C.; List, S. S.; Chappell, K. G.; Heen, E.
Show abstract
Systematic reviewing is a time-consuming process that can be aided by artificial intelligence (AI). There are several AI options to assist with title/abstract screening, however options for full text screening (FTS) are limited. The objective of this study was to evaluate the reliability of a custom GPT (cGPT) for FTS. A cGPT powered by OpenAIs ChatGPT4o was trained and tested with a subset of articles assessed in duplicate by human reviewers. Outputs from the testing subset were coded to simulate cGPT as an autonomous and assistant reviewer. Cohens kappa was used to assess interrater agreement. The threshold for practical use was defined as a cGPT-human kappa score exceeding the lower bound of the confidence interval (CI) for the lowest human-human kappa score in inclusion/exclusion and exclusion reason decisions. cGPT as an assistant reviewer met this reliability threshold. With the Cohens kappa CI for human-human pairs ranging from 0.658 to 1.00 in the inclusion/exclusion decision, assistant cGPT and human kappa scores were encompassed in two of four pairings. In exclusion reason classification, the benchmark human-human kappa score CI range was 0.606 to 0.912. Assistant cGPT and human kappa scores were encompassed in one of four pairings. cGPT as an autonomous reviewer did not meet reliability thresholds. cGPT as an assistant could speed up systematic reviewing in a sufficiently reliable way. More research is needed to establish standardized thresholds for practical use. While the current study dealt with physiological population parameters, cGPTs can assist in FTS of systematic reviews in any field. HIGHLIGHTSO_LIThere are several AI options to assist in title/abstract screening in systematic reviewing, however, options for full text screening are limited. C_LIO_LIThe reliability of a tailor-made AI model in the form of a custom GPT was explored in the role of an assistant to a human reviewer and as an autonomous reviewer. C_LIO_LIInterrater agreement was sufficient when the model operated in the role of assistant reviewer but not in the role of autonomous reviewer. Here the model misclassified two articles out of ten, whereas the human reviewers erred in approximately one out of ten articles. C_LIO_LIThe study shows that it is possible to craft a custom GPT as a useful assistant in systematic reviews. cGPTs can be crafted to assist in reviews in any field. C_LIO_LIAn automated setup for inputting articles and coding cGPT responses is needed to maximize the potential time-saving benefit. C_LI
Vidor, P. R.; Casiraghi, Y.; de Souza, A. M.; Schmidt, M. I.
Show abstract
BACKGROUNDBias assessment is a crucial step in evaluating evidence from randomized controlled trials. The widely adopted Cochrane RoB 2, designed to identify these issues, is complex, resource-intensive, and unreliable. Advances in artificial intelligence (AI), particularly in the field of large language models (LLMs), now allow the automation of complex tasks. While prior investigations have focused on whether LLMs could perform assessments with RoB 2, integrating technologies does not resolve the intrinsic methodological issues of the instrument. This is the first feasibility study to evaluate the reliability of ROBUST-RCT, a novel bias assessment tool, as applied by humans and LLMs. METHODSA sample of RCTs of drug interventions was screened for eligibility. Reviewers working independently used ROBUST-RCT to assess different aspects of the studies and then reached a consensus through discussion. A chain-of-thought prompt instructed four LLMs on how to apply ROBUST-RCT. The primary analysis used Gwets AC2 coefficient and benchmarking to assess inter-rater reliability of the "judgment set", defined as the series of final assessments for the six core items in the ROBUST-RCT tool. RESULTS54 assessments of each LLM were compared to human consensus in the primary analysis. Gwets AC2 inter-rater reliability ranged from 0.46 to 0.69. With 95% confidence, three of the four tested LLMs achieved moderate or higher reliability based on probabilistic benchmarking. A secondary analysis also found a Fleiss Kappa of 0.49 (95% CI: 0.30 - 0.60) between human reviewers before consensus, numerically higher than the values reported in prior literature about RoB 2. CONCLUSIONLarge Language Models (LLMs) can effectively perform risk-of-bias assessments using the ROBUST-RCT tool, enabling their integration into future systematic review workflows aiming for enhanced objectivity and efficiency.
Wilkinson, J. D.; Heal, C.; Flemyng, E.; Antoniou, G. A.; Aburrow, T.; Alfirevic, Z.; Avenell, A.; Barbour, V.; Berghella, V.; Bishop, D. V.; Bordewijk, E. M.; Brown, N. J.; Christopher, J.; Clarke, M.; Dahly, D. L.; Dennis, J.; Dicker, P.; Dumville, J.; Frankish, H.; Grey, A.; Grohmann, S.; Gurrin, L. C.; Hayden, J. A.; Heathers, J. A.; Hunter, K. E.; Hussey, I.; Jung, L.; Lam, E.; Lasserson, T. J.; Lensen, S.; Li, T.; Li, W.; Liu, J.; Loder, E.; Lundh, A.; Meyerowitz-Katz, G.; Mol, B. W.; Naudet, F.; Noel-Storr, A.; O'Connell, N.; Parker, L.; Redberg, R. F.; Redman, B. K.; Richardson, R.; Se
Show abstract
PrecisThe integrity of evidence synthesis is threatened by problematic randomised controlled trials (RCTs). These are RCTs where there are serious concerns about the trustworthiness of the data or findings. This could be due to research misconduct, including fraud, or due to honest critical errors. If these RCTs are not detected, they may be inadvertently included in systematic reviews and guidelines, potentially distorting their results. To address this problem, the INSPECT-SR (INveStigating ProblEmatic Clinical Trials in Systematic Reviews) tool has been developed to assess the trustworthiness of RCTs. This will allow problematic RCTs to be identified and excluded from systematic reviews. This paper describes the development of INSPECT-SR. The tool and an associated guidance document are presented.
Henry, M.; O'Connell, N.; Riley, R.; Moons, K.; Shea, B.; Hooft, L.; Wallwork, S.; Damen, J.; Skoetz, N.; Appiah, R.; Berryman, C.; Crouch, S.; Ferencz, G.; Grant, A.; Henry, K.; Herman, A.; Karran, E.; Koralegedera, I.; Leake, H.; MacIntyre, E.; Mouatt, B.; Phuentsho, K.; Van Der Laan, D.; Welsby, E.; Wiles, L.; Wilkinson, E.; Wilson, M.; Wilson, M.; Moseley, L.
Show abstract
BackgroundThis paper details initial testing of the agreeability and usability of a novel quality appraisal tool for systematic reviews of prognostic factor studies: AMSTAR-PF. MethodsFourteen appraisers each assessed eight systematic reviews using AMSTAR-PF. Their ratings for each question and each article were compared, with interrater, inter-pair and intrapair agreeability calculated using Gwets agreement coefficient. Time of use and time to reach consensus were also recorded. ResultsInterrater agreement averaged 0.59 (range, 0.21-0.90), inter-pair 0.61 (range 0.24-0.91) and intrapair 0.75 (range 0.45-0.95) across the domains, with agreement for the overall rating 0.46 (95%CI 0.30-0.62) for interrater, 0.46 (95%CI 0.17-0.74) for inter-pair, and 0.68 (range of averages 0.22-1.00) for intrapair agreement. The majority (60.7%) of intrapair ratings were identical, with 94.6% of final ratings either identical or only one category different for the overall appraisal. The time taken to appraise a study with AMSTAR-PF improved with use and averaged around 34 minutes after the first two appraisals. ConclusionsDespite some variance in agreeability for different domains and between different appraisers, the testing results suggest that AMSTAR-PF has clear utility for appraising the quality of systematic reviews of prognostic factor studies.
Heston, T. F.
Show abstract
Clinical studies commonly report p-values but rarely quantify how stable those p-values are or how far the observed data lie from the point representing no effect. This study introduces a unified framework that evaluates statistical significance, fragility, and neutrality distance across three standard clinical data structures: single-arm binomial outcomes, two-arm binary outcomes, and continuous two-group outcomes. The objective was to determine whether reporting these three components together can improve the interpretation of clinical research results. Using previously published summary statistics, we calculated significance, fragility, and neutrality distance for representative examples from each design category. The framework applies the diagnostic fragility quotient and a proportion-based neutrality measure for single-arm benchmarks; the global fragility quotient and risk quotient for two-arm binary outcomes; and the continuous fragility scale and meaningful change index for mean comparisons. Across all examples, the triplet revealed patterns that were not detectable with p-values or effect sizes alone. Some statistically significant findings were highly fragile or close to neutrality despite appearing reliable. At the same time, some non-significant results showed meaningful separation from the no-effect state despite stable p-values. These findings highlight how statistical significance, decision stability, and distance from neutrality represent distinct dimensions of evidence that can diverge in clinically important ways. This triplet provides a concise, generalizable summary of evidence quality that enhances transparency and reduces misinterpretation across a broad range of study designs.
Disher, T.; Janoudi, G.; Uzun, M.
Show abstract
1.BackgroundTitle and abstract (TiAb) screening in systematic literature reviews (SLRs) is labor-intensive. While agentic artificial intelligence (AI) platforms like Loon Lens 1.0 offer automation, lower precision can necessitate increased full-text review. This study evaluated the calibration of Loon Lens 1.0s confidence ratings to prioritize citations for human review. MethodsWe conducted a post-hoc analysis of citations included in a previous validation of Loon Lens 1.0. The data set consists of records screened by both Loon Lens 1.0 and human reviewers (gold standard). A logistic regression model predicted the probability of discrepancy between Loon Lens and human decisions, using Loon Lens confidence ratings (Low, Medium, High, Very High) as predictors. Model performance was assessed using bootstrapping with 1000 resamples, calculating optimism-corrected calibration, discrimination (C-index), and diagnostic metrics. ResultsLow and Medium confidence citations comprised 5.1% of the sample but accounted for 60.6% of errors. The logistic regression model demonstrated excellent discrimination (C-index = 0.86) and calibration, accurately reflecting observed error rates. "Low" confidence citations had a predicted probability of error of 0.65 (95% CI: 0.56-0.74), decreasing substantially with higher confidence: 0.38 (95% CI 0.28-0.49) for "Medium", 0.05 (95% CI 0.04-0.07) for "High", and 0.01 (95% CI 0.007-0.01) for "Very High". Human review of "Low" and "Medium" confidence abstracts would lead to improved overall precision from 62.97% to 81.4% while maintaining high sensitivity (99.3%) and specificity (98.1%). ConclusionsLoon Lens 1.0s confidence ratings show good calibration used as the basis for a model predicting the probability of making an error. Targeted human review significantly improves precision while preserving recall and specificity. This calibrated model offers a practical strategy for optimizing human-AI collaboration in TiAb screening, addressing the challenge of lower precision in automated approaches. Further research is needed to assess generalizability across diverse review contexts.
Agarwal, A.; Albarqouni, L.; Badran, N.; Brax, N.; Gandhi, P.; Pereira, T.; Roberts, A.; El Zein, O.; Akl, E.
Show abstract
Independent systematic reviewers may arrive at different conclusions when analyzing evidence addressing the same clinical questions. Similarly, independent expert panels may arrive at different recommendations addressing the same clinical topics. When faced with a multiplicity of reviews or guidelines on a given topic, users are likely to benefit from a structured approach to evaluate concordance, and to explain discordant findings and recommendations. This protocol proposes a methodological survey to evaluate the prevalence of concordance between reviews addressing similar clinical questions, and between clinical practice guidelines addressing similar topics; and to identify methodological frameworks for the evaluation of concordance between related reviews and between related guidelines.
Pitre, T.; Jassal, T.; Talukdar, J. R.; Shahab, M.; Ling, M.; Zeraatkar, D.
Show abstract
BackgroundInternationally accepted standards for systematic reviews necessitate assessment of the risk of bias of primary studies. Assessing risk of bias, however, can be time- and resource-intensive. AI-based solutions may increase efficiency and reduce burden. ObjectiveTo evaluate the reliability of ChatGPT for performing risk of bias assessments of randomized trials using the revised risk of bias tool for randomized trials (RoB 2.0). MethodsWe sampled recently published Cochrane systematic reviews of medical interventions (up to October 2023) that included randomized controlled trials and assessed risk of bias using the Cochrane-endorsed revised risk of bias tool for randomized trials (RoB 2.0). From each eligible review, we collected data on the risk of bias assessments for the first three reported outcomes. Using ChatGPT-4, we assessed the risk of bias for the same outcomes using three different prompts: a minimal prompt including limited instructions, a maximal prompt with extensive instructions, and an optimized prompt that was designed to yield the best risk of bias judgements. The agreement between ChatGPTs assessments and those of Cochrane systematic reviewers was quantified using weighted kappa statistics. ResultsWe included 34 systematic reviews with 157 unique trials. We found the agreement between ChatGPT and systematic review authors for assessment of overall risk of bias to be 0.16 (95% CI: 0.01 to 0.3) for the maximal ChatGPT prompt, 0.17 (95% CI: 0.02 to 0.32) for the optimized prompt, and 0.11 (95% CI: -0.04 to 0.27) for the minimal prompt. For the optimized prompt, agreement ranged between 0.11 (95% CI: -0.11 to 0.33) to 0.29 (95% CI: 0.14 to 0.44) across risk of bias domains, with the lowest agreement for the deviations from the intended intervention domain and the highest agreement for the missing outcome data domain. ConclusionOur results suggest that ChatGPT and systematic reviewers only have "slight" to "fair" agreement in risk of bias judgements for randomized trials. ChatGPT is currently unable to reliably assess risk of bias of randomized trials. We advise against using ChatGPT to perform risk of bias assessments. There may be opportunities to use ChatGPT to streamline other aspects of systematic reviews, such as screening of search records or collection of data.
Gorelik, A. J.; Gorelik, M. G.; Ridout, K. K.; Nimarko, A. F.; Peisch, V.; Kuramkote, S. R.; Low, M.; Pan, T.; Singh, S.; Nrusimha, A.; Singh, M. K.
Show abstract
The rapidly burgeoning quantity and complexity of publications makes curating and synthesizing information for meta-analyses ever more challenging. Meta-analyses require manual review of abstracts for study inclusion, which is time consuming, and variation among reviewer interpretation of inclusion/exclusion criteria for selecting a paper to be included in a review can impact a studys outcome. To address these challenges in efficiency and accuracy, we propose and evaluate a machine learning approach to capture the definition of inclusion/exclusion criteria using a machine learning model to automate the selection process. We trained machine learning models on a manually reviewed dataset from a meta-analysis of resilience factors influencing psychopathology development. Then, the trained models were applied to an oncology dataset and evaluated for efficiency and accuracy against trained human reviewers. The results suggest that machine learning models can be used to automate the paper selection process and reduce the abstract review time while maintaining accuracy comparable to trained human reviewers. We propose a novel approach which uses model confidence to propose a subset of abstracts for manual review, thereby increasing the accuracy of the automated review while reducing the total number of abstracts requiring manual review. Furthermore, we delineate how leveraging these models more broadly may facilitate the sharing and synthesis of research expertise across disciplines.
Yi, Y.; Lin, A.; Zhou, C.; Zhang, J.; Wang, S.; Luo, P.
Show abstract
Meta-analysis is a common statistical method used to summarize multiple studies that cover the same topic. It can provide less biased results and explain heterogeneity between studies. Although there exists a variety of meta-analysis softwares, they are rarely both convenient to use and capable of comprehensive analytical functions. As a result, we are motivated to establish a meta-analysis web tool called Meta-Analysis Online (Onlinemeta). Onlinemeta includes three major modules: risk bias analysis, meta-analysis and network meta-analysis. The risk bias analysis module can produce heatmaps and histograms, whereas the meta-analysis module accepts many types of data as input, including dichotomous variables, single-armed dichotomous variables, continuous variables, single-armed continuous variables, survival data, deft method and diagnostic experiments, and outputs well-tuned forest plots, sensitivity analysis forest plots, funnel plots, comparison table of effects, SROC curve, and crosshair plots. For network meta-analysis module, the tool can process dichotomous variables or continuous variables, and generate network plots, forest plots, SUCRA (Surface Under the Cumulative Ranking) plots, rank plots and heatmaps. Onlinemeta is available at https://smuonco.Shinyapps.io/Onlinemeta/.
Bastian, H.; Hemkens, L. G.
Show abstract
BackgroundFrom 2006 to 2019, Cochrane reviews could be designated "stable" if they were not being updated but highly likely to be current. This provides an opportunity to observe practice in ending systematic reviewing and what is regarded as enough evidence. MethodsWe identified Cochrane reviews designated stable in 2013 and 2019 and reasons for this designation. For those with conclusions stated to be so firm that new evidence is unlikely to change them, we assessed conclusions, strength of evidence ratings, and recommendations for further research. We assessed the fate of the 2013 stable reviews. We also estimated usage of formal analytic methods to determine when there is enough evidence in protocols for Cochrane reviews. ResultsCochrane reviews were rarely designated stable. In 2019, there were 507 stable Cochrane reviews (6.6% of 7,645 non-withdrawn reviews). The most common reasons related to no, little, or infrequent research activity expected (331 of 505; 65.5%). Only 39 reviews were stable because of firm conclusions unlikely to be changed by new evidence (7.7%), but that declaration was mostly not supported by judgments made in the review about strength of evidence and implications for research. Among the 180 reviews stable in 2013, 16 reverted to normal status (8.9%), with 2 of those changing conclusions because of new studies. Few Cochrane protocols specified an analytic method for determining when there was enough evidence to stop updating the review (116 of 2,415; 4.8%). ConclusionCochrane reviews were more likely to end because important future primary research activity was believed to be unlikely, than because there was enough evidence. Judgments about the strength of evidence and need for research were often inconsistent with the declaration that conclusions were unlikely to change. The inconsistencies underscore the need for reliable analytic methods to support decision-making about the conclusiveness of evidence.
Bastian, H.; Doust, J.; Clarke, M.; Glasziou, P.
Show abstract
BackgroundThe Cochrane Collaboration has been publishing systematic reviews in the Cochrane Database of Systematic Reviews (CDSR) since 1995, with the intention that these be updated periodically. ObjectivesTo chart the long-term updating history of a cohort of Cochrane reviews and the impact on the number of included studies. MethodsThe status of a cohort of Cochrane reviews updated in 2003 was assessed at three time points: 2003, 2011, and 2018. We assessed their subject scope, compiled their publication history using PubMed and CDSR, and compared them to all Cochrane reviews available in 2002 and 2017/18. ResultsOf the 1,532 Cochrane reviews available in 2002, 11.3% were updated in 2003, with 16.6% not updated between 2003 and 2011. The reviews updated in 2003 were not markedly different to other reviews available in 2002, but more were retracted or declared stable by 2011 (13.3% versus 6.3%). The 2003 update led to a major change of the conclusions of 2.8% of updated reviews (n = 177). The cohort had a median time since publication of the first full version of the review of 18 years and a median of three updates by 2018 (range 1-11). The median time to update was three years (range 0-14 years). By the end of 2018, the median time since the last update was seven years (range 0-15). The median number of included studies rose from eight in the version of the review before the 2003 update, to 10 in that update and 14 in 2018 (range 0-347). ConclusionsMost Cochrane reviews get updated, however they are becoming more out-of-date over time. Updates have resulted in an overall rise in the number of included studies, although they only rarely lead to major changes in conclusion.
Xu, C.; Furuya-Kanamori, L.; Lin, L.; Doi, S. A. R.
Show abstract
In this study, we examined the discrepancy between large studies and small studies in meta-analyses of rare event outcomes and the impact of Peto versus the classic odds ratios (ORs) through empirical data from the Cochrane Database of Systematic Reviews that collected from January 2003 to May 2018. Meta-analyses of binary outcomes with rare events (event rate [≤]5%), with at least 5 studies, and with at least one large study (N[≥]1000) were extracted. The Peto and classic ORs were used as the effect sizes in the meta-analyses, and the magnitude and direction of the ORs of the meta-analyses of large studies versus small studies were compared. The p-values of the meta-analyses of small studies were examined to assess if the Peto and the classic OR methods gave similar results. Totally, 214 meta-analyses were included. Over the total 214 pairs of pooled ORs of large studies versus pooled small studies, 66 (30.84%) had a discordant direction (kappa=0.33) when measured by Peto OR and 69 (32.24%) had a discordant direction (kappa=0.22) when measured by classic OR. The Peto ORs resulted in smaller p-values compared to classic ORs in a substantial (83.18%) number of cases. In conclusion, there is considerable discrepancy between large studies and small studies among the results of meta-analyses of sparse data. The use of Peto odds ratios does not improve this situation and is not recommended as it may result in less conservative error estimation.
Otte, W. M.; van IJzendoorn, D. G.; Habets, P. C.; Vinkers, C. H.
Show abstract
The synthesis of treatment effects relies on systematic reviews of intervention trials. This process is often laborious due to the need for precise search queries and manual study identification. Recent advancements in database architecture and natural language processing (NLP) offer a potential solution by enabling faster, more flexible searches and automated extraction of information from unstructured texts. Our study assesses the effectiveness of NLP-based literature searches within a novel database structure in comparison to the Cochrane Database of Systematic Reviews. We created a user-friendly elastic search database containing 36 million PubMed-indexed entries. We developed reliable filters for identifying randomized clinical trials and clinical intervention studies, as well as extracting relevant subtext related to population and intervention. Our results indicate a high precision of 0.74, recall of 0.81, and F1-score of 0.77 for population subtext, and a precision of 0.70, recall of 0.71, and an F1-score of 0.70 for intervention subtext. Our approach efficiently identified included studies in 90% of systematic reviews, missing no more than two trials compared to Cochrane. Furthermore, it produced fewer total hits than a comparable PubMed keyword search, demonstrating the potential of the new database structure to enhance the efficiency and effectiveness of aggregating clinical evidence.
Veroniki, A.-A.; Wolfe, D.; Hutton, B.; Schwarzer, G.; McIssac, D. I.; Straus, S. E.; Jackson, D.; Tricco, A. C.
Show abstract
StandfirstSystematic reviews with network meta-analysis (NMA) frequently evaluate complex interventions combining multiple healthcare interventions (known as components). Components may act separately of each other or in conjunction with other components, synergistically or antagonistically. Component effect estimation is crucial to produce relevant and clinically meaningful evidence. However, standard NMA cannot quantify individual component effects of complex interventions. This study presents methods for modeling complex interventions and highlights the advantages and limitations of component NMA (CNMA). CNMA enables the estimation of individual component effects, whether additive or interactive. Interaction CNMA can be considered an extension of the additive CNMA model that includes interaction terms. We give practical guidance on how to carry out these analyses via empirical examples, which showcase both the strengths and limitations of CNMA. Implementing CNMA models is complex and requires the skills of a multidisciplinary team including clinicians, methodologists, and statisticians. Summary pointsO_LICNMA provides the opportunity to disentangle the effects of components of complex interventions and assess their efficacy or safety, accounting for potential interactions in component combinations. C_LIO_LIUnder the additivity assumption, CNMA assumes that the total effect of a complex intervention is the sum of its individual component effects (e.g., if component A lowers a symptom score by 2 points and B by 1 point, A+B is expected to lower it by 3 points), while interaction CNMA is used when there is evidence of violation of additivity and the combined effect differs due to synergy or antagonism. C_LIO_LIInteraction CNMA can be considered a compromise between additive CNMA and standard NMA, but selecting interaction terms requires clinical, statistical, and methodological considerations. C_LIO_LIClinicians and other knowledge users should be engaged in the selection of interaction CNMA models to ensure biological plausibility. C_LI
Tran, V.-T.; Gartlehner, G.; Yaacoub, S.; Boutron, I.; Schwingshackl, L.; Stadelmaier, J.; Sommer, I.; Aboulayeh, F.; Afach, S.; Meerpohl, J.; Ravaud, P.
Show abstract
ImportanceSystematic reviews are time-consuming and are still performed predominately manually by researchers despite the exponential growth of scientific literature. ObjectiveTo investigate the sensitivity, specificity and estimate the avoidable workload when using an AI-based large language model (LLM) (Generative Pre-trained Transformer [GPT] version 3.5-Turbo from OpenAI) to perform title and abstract screening in systematic reviews. Data SourcesUnannotated bibliographic databases from five systematic reviews conducted by researchers from Cochrane Austria, Germany and France, all published after January 2022 and hence not in the training data set from GPT 3.5-Turbo. DesignWe developed a set of prompts for GPT models aimed at mimicking the process of title and abstract screening by human researchers. We compared recommendations from LLM to rule out citations based on title and abstract with decisions from authors, with a systematic reappraisal of all discrepancies between LLM and their original decisions. We used bivariate models for meta-analyses of diagnostic accuracy to estimate pooled estimates of sensitivity and specificity. We performed a simulation to assess the avoidable workload from limiting human screening on title and abstract to citations which were not "ruled out" by the LLM in a random sample of 100 systematic reviews published between 01/07/2022 and 31/12/2022. We extrapolated estimates of avoidable workload for health-related systematic reviews assessing therapeutic interventions in humans published per year. ResultsPerformance of GPT models was tested across 22,666 citations. Pooled estimates of sensitivity and specificity were 97.1% (95%CI 89.6% to 99.2%) and 37.7%, (95%CI 18.4% to 61.9%), respectively. In 2022, we estimated the workload of title and abstract screening for systematic reviews to range from 211,013 to 422,025 person-hours. Limiting human screening to citations which were not "ruled out" by GPT models could reduce workload by 65% and save up from 106,268 to 276,053-person work hours (i.e.,66 to 172-person years of work), every year. Conclusions and RelevanceAI systems based on large language models provide highly sensitive and moderately specific recommendations to rule out citations during title and abstract screening in systematic reviews. Their use to "triage" citations before human assessment could reduce the workload of evidence synthesis.
Lee, D. C. W.; O'Brien, K. M.; Presseau, J.; Yoong, S.; Lecathelinais, C.; Wolfenden, L.; Thomas, J.; Arno, A.; Hutton, B.; Hodder, R. K.
Show abstract
BackgroundSystematic reviews are important for informing public health policies and program selection; however, they are time- and resource-intensive. Artificial intelligence (AI) offers a solution to reduce these labour-intensive requirements for various aspects of systematic review production, including data extraction. To date, there is limited robust evidence evaluating the accuracy and efficiency of AI for data extraction. This study within a review (SWAR) aimed to determine whether human data extraction assisted by an AI research assistant (Elicit(R)) is noninferior to human-only data extraction in terms of accuracy (i.e. agreement) and time-to-completion. Secondary aims included comparing error types and costs. MethodsA two-arm noninferiority SWAR was conducted to compare AI-assisted and human-only data extraction from 50 RCTs chronic disease interventions. Participants were randomised to extract all data required for conducting a review, using either the AI-assisted or human-only method. Accuracy was assessed using a three-point rubric by an independent assessor blinded to group allocation, based on agreement between extracted data and the assessor. Accuracy scores were standardized to a 0-100 scale. Analysis included overall and subgroup accuracy (data group and data type) using paired t-tests. Time-to-completion was self-reported by data extractors. Type of errors were coded by type and severity, and costs were calculated for data extraction, preparation of files, training and the Elicit(R) Pro subscription. ResultsThere was no difference in overall accuracy between the AI-assisted and human-only arms (mean difference (MD) 0.57 (on a 0-100 scale), 95% confidence interval (CI) -1.29, 2.43). Subgroup analysis by data group found AI-assisted to be more accurate than human-only data extraction for data variables describing intervention and control group (MD 4.75, 95% CI 2.13, 7.38), but otherwise no subgroup differences were observed. AI-assisted data extraction was significantly faster (MD 24.82 mins, 95% CI 18.80, 30.84). The AI-assisted arm made similar error types (missed or omitted data: AI-assisted 3.6%, human-only 3.4%) and severity (minor errors: AI-assisted 6.7%, human-only 6.5%) and cost $181.98 less than the human-only data extraction across the 50 studies. ConclusionAI-assisted data extraction using Elicit(R) showed noninferior accuracy, faster completion times, similar error types and severity, and lower costs compared to human-only extraction. These efficiency gains, without loss in accuracy suggest AI-assisted data extraction can replace one human-only data extractor in future systematic reviews of RCTs. Future research should explore different models of AI data extraction such as two AI-assisted extractors or AI-only extractor with human-only extractor, and comparison of AI-assisted to AI-only.